36 research outputs found
Context-aware and Scale-insensitive Temporal Repetition Counting
Temporal repetition counting aims to estimate the number of cycles of a given
repetitive action. Existing deep learning methods assume repetitive actions are
performed in a fixed time-scale, which is invalid for the complex repetitive
actions in real life. In this paper, we tailor a context-aware and
scale-insensitive framework, to tackle the challenges in repetition counting
caused by the unknown and diverse cycle-lengths. Our approach combines two key
insights: (1) Cycle lengths from different actions are unpredictable that
require large-scale searching, but, once a coarse cycle length is determined,
the variety between repetitions can be overcome by regression. (2) Determining
the cycle length cannot only rely on a short fragment of video but a contextual
understanding. The first point is implemented by a coarse-to-fine cycle
refinement method. It avoids the heavy computation of exhaustively searching
all the cycle lengths in the video, and, instead, it propagates the coarse
prediction for further refinement in a hierarchical manner. We secondly propose
a bidirectional cycle length estimation method for a context-aware prediction.
It is a regression network that takes two consecutive coarse cycles as input,
and predicts the locations of the previous and next repetitive cycles. To
benefit the training and evaluation of temporal repetition counting area, we
construct a new and largest benchmark, which contains 526 videos with diverse
repetitive actions. Extensive experiments show that the proposed network
trained on a single dataset outperforms state-of-the-art methods on several
benchmarks, indicating that the proposed framework is general enough to capture
repetition patterns across domains.Comment: Accepted by CVPR202
SINet: A Scale-insensitive Convolutional Neural Network for Fast Vehicle Detection
Vision-based vehicle detection approaches achieve incredible success in
recent years with the development of deep convolutional neural network (CNN).
However, existing CNN based algorithms suffer from the problem that the
convolutional features are scale-sensitive in object detection task but it is
common that traffic images and videos contain vehicles with a large variance of
scales. In this paper, we delve into the source of scale sensitivity, and
reveal two key issues: 1) existing RoI pooling destroys the structure of small
scale objects, 2) the large intra-class distance for a large variance of scales
exceeds the representation capability of a single network. Based on these
findings, we present a scale-insensitive convolutional neural network (SINet)
for fast detecting vehicles with a large variance of scales. First, we present
a context-aware RoI pooling to maintain the contextual information and original
structure of small scale objects. Second, we present a multi-branch decision
network to minimize the intra-class distance of features. These lightweight
techniques bring zero extra time complexity but prominent detection accuracy
improvement. The proposed techniques can be equipped with any deep network
architectures and keep them trained end-to-end. Our SINet achieves
state-of-the-art performance in terms of accuracy and speed (up to 37 FPS) on
the KITTI benchmark and a new highway dataset, which contains a large variance
of scales and extremely small objects.Comment: Accepted by IEEE Transactions on Intelligent Transportation Systems
(T-ITS
RGB-D Visual Saliency Detection Algorithm Based on Information Guided and Multimodal Feature Fusion
With the development of scientific information technology and the popularization of electronic devices, images and videos have become very important forms of information expression and carriers in our current lives. Accelerating the mining of valuable information content from massive data has become a very important aspect of current computer vision research. The saliency object detection method, which is related to human visual attention, is gradually being applied in computer processing. However, in current color depth models, the association mining of data depth clues is still far from sufficient, and there is still significant room for improvement in image quality. Based on this, an improved color depth detection model is proposed for information guided and multi feature fusion, and an absorption Markov model is introduced to optimize the guidance of low-level, middle-level, and high-level saliency maps, grasping different feature information contents. Subsequently, the gradual guidance of the network is achieved from aspects such as feature encoding, multi-scale and multi attention models, and attention refinement mechanisms. The experimental analysis of the fusion model proposed in the study showed that the average classification improvement accuracy of the fusion model reached 5.23%, and its error value was less than 0.1. The effectiveness on all four quantitative indicators exceeded 92%. The system’s detection response rate exceeded 93%, which is limited by the target object and results in a decrease in accuracy. This algorithm can provide reference value and means for target localization recognition and virtual scene detection
GDFace: Gated deformation for multi-view face image synthesis
Photorealistic multi-view face synthesis from a single image is an important but challenging problem. Existing methods mainly learn a texture mapping model from the source face to the target face. However, they fail to consider the internal deformation caused by the change of poses, leading to the unsatisfactory synthesized results for large pose variations. In this paper, we propose a Gated Deformable Face Synthesis Network to model the deformation of faces that aids the synthesis of the target face image. Specifically, we propose a dual network that consists of two modules. The first module estimates the deformation of two views in the form of convolution offsets according to the input and target poses. The second one, on the other hand, leverages the predicted deformation offsets to create the target face image. In this way, pose changes are explicitly modeled in the face generator to cope with geometric transformation, by adaptively focusing on pertinent regions of the source image. To compensate offset estimation errors, we introduce a soft-gating mechanism that enables adaptive fusion between deformable features and primitive features. Extensive experimental results on five widely-used benchmarks show that our approach performs favorably against the state-of-the-arts on multi-view face synthesis, especially for large pose changes